Main Question
Our main question:
- How does the inclusion of different numbers of age groups influence the results of an analysis of right to carry laws and violence rates?
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
This case study will introduce the topic of multicolinearity. We will do so by showcasing a real world example where multicolinearity in part resulted in historically contriversial and conflicting findings about the influence of the adoption of right-to-carry (RTC) concealed handgun laws on violent crime rates in the United States.
We will focus on two articles:
This has been a controversial topic as many other articles also had conflicting results. See here for a list of studies.
The Donohue, et al. article discusses how there are many other important methodolical aspects besides multicolinearity that could account for the historically conflicting results in these previous papers.
In fact, nearly every aspect of the data analysis process was different between the Donohue, et al. analysis and the Lott and Mustard analysis.
However, we will focus particularly on multicolinearity and we will explore how it can influence linear regression analyses and result in different conclusions.
This analysis will demonstrate how methodological details can be critically influential for our overall conclusions and can result in important policy related consequences. This article will provide a basis for the motivation.
John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).
David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).
Here you can see the differences in the data used in the featured RTC articles:
We will perform analyses similar to those in these articles, however we will not try to recreate them, instead we will simplify our analysis to allow us to focus on multicolinearity.
Therefore we will use a subset of the listed explanatory variables and they will be consistent for both analyses that we will perform, with the exception that one analysis will have 6 demographic variables like the analysis in the Donohue, et al. article and the other will have 36 demogrpahic variables like the analysis in the Lott and Mustard article.
Our main question:
Statistical Learning Objectives:
In this case study, students will learn:
1) what multicolinearity is and how it can influence linear regression coefficients
2) how to look for the presence of multicolinarity
3) the difference between multicolinearity and correlation
Data science Learning Objectives:
We will especially focus on using packages and functions from the Tidyverse, such as dplyr and ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
So what exactly is a right-to-carry law?
It is a law thatspecifies if and how citizens are allowed to have a firearm on their person or nearby (for example in the citizen’s car) in public.
The Second Amendment to the United States Constitution guarantees the right to “keep and bear arms”. The amendment was ratified in 1791 as part of the Bill of Rights.
However, there are no federal laws about carrying firearms in public.
These laws are created and enforced at the state level. Sates vary greatly in their laws about the right to carry firearms. Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to legally carry a firearm.
According to Wikipedia about the history of right-to-carry policies in the United States:
Public perception on concealed carry vs open carry has largely flipped. In the early days of the United States, open carrying of firearms, long guns and revolvers was a common and well-accepted practice. Seeing guns carried openly was not considered to be any cause for alarm. Therefore, anyone who would carry a firearm but attempt to conceal it was considered to have something to hide, and presumed to be a criminal. For this reason, concealed carry was denounced as a detestable practice in the early days of the United States.
Concealed weapons bans were passed in Kentucky and Louisiana in 1813. (In those days open carry of weapons for self-defense was considered acceptable; concealed carry was denounced as the practice of criminals.) By 1859, Indiana, Tennessee, Virginia, Alabama, and Ohio had followed suit. By the end of the nineteenth century, similar laws were passed in places such as Texas, Florida, and Oklahoma, which protected some gun rights in their state constitutions. Before the mid 1900s, most U.S. states had passed concealed carry laws rather than banning weapons completely. Until the late 1990s, many Southern states were either “No-Issue” or “Restrictive May-Issue”. Since then, these states have largely enacted “Shall-Issue” licensing laws, with numerous states legalizing “Unrestricted concealed carry”.
See here for more information.
Here are the general categories of Right to Carry Laws:
You can see that none of the fifty states have no-issue laws currently, meaning that all states allow the right to carry firearms at least in some way, however the level of restrictions is dramatically different from one state to another.
Here you can see how these laws have changed over time around the country:
There is variation from state to state even within the same general category:
For example here are the current carry laws in Idaho which is considered an “Unrestricted - no permit required” state:
Idaho permits the open carrying of firearms.
Idaho law permits both residents and non-residents who are at least 18 years old to carry concealed weapons, without a carry license, outside the limits of or confines of any city, provided the person is not otherwise disqualified from being issued a license to carry.
A person may also carry concealed weapons on or about his or her person, without a license, in the person’s own place of abode or fixed place of business, on property in which the person has any ownership or leasehold interest, or on private property where the person has permission to carry from any person who has an ownership or leasehold interest in that property.
State law also allows any resident of Idaho or a current member of the armed forces of the United States to carry a concealed handgun without a license to carry, provided the person is over 18 years old and not disqualified from being issued a license to carry concealed weapons under state law. An amendment to state law that takes effect on July 1, 2020 changes the reference in the above law from “a resident of Idaho” to “any citizen of the United States.”
And here are the current carry laws in Arizona which is also considered an “Unrestricted- - no permit required” state:
Arizona respects the right of law abiding citizens to openly carry a handgun.
Any person 21 years of age or older, who is not prohibited possessor, may carry a weapon openly or concealed without the need for a license. Any person carrying without a license must acknowledge and comply with the demands of a law enforcement officer when asked if he/she is carrying a concealed deadly weapon, if the officer has initiated an “investigation” such as a traffic stop.
Notice that citizens in Idaho only need to be 18 to carry a firearm, whereas they must be 21 in Arizona.
In contrast here is an example of current carry laws in Maryland which is considered a “Rights Restricted-Very Limited Issue” state:
Carrying and Transportation in Vehicles It is unlawful for any person without a permit to wear or carry a handgun, openly or concealed, upon or about his person. It is also unlawful for any person to knowingly transport a handgun in any vehicle traveling on public roads, highways, waterways or airways, or upon roads or parking lots generally used by the public. This does not apply to any person wearing, carrying or transporting a handgun within the confines of real estate owned or leased by him, or on which he resides, or within the confines of a business establishment owned or leased by him.
Permit To Carry Application for a permit to carry a handgun is made to the Secretary of State Police. In addition to the printed application form, the applicant should submit a notarized letter stating the reasons why he is applying for a permit.
avocado….Right to carry and covid masks?
There are some important considerations regarding this data analysis to keep in mind:
We do not use all of the data used by either the Lott and Mustard or Donohue, et al. analyses, nor do we perform the same analysis of each article. We instead perform a much simpler analysis with less variables for the purposes of illustration of the concept of multicollinearity and its influence on regression coefficients, not to reproduce either analysis.
Because our analysis is an oversimplification, our analysis should not be used for determining policy changes, instead we suggest that users consult with a specialist.
We would also like to note that…AVOCADO It is important that we do not treat race as an objective measure. Despite this, it can be used to advance scientific inquiry. For more information on this topic, we have included a link to a paper on the use of race as a measure in epidemiology.
We will begin by loading the packages that we will need:
library(here)
library(readxl)
library(readr)
library(pdftools)
library(dplyr)
library(magrittr)
library(tidyr)
library(stringr)
library(forcats)
library(car) # vif function
library(plm) # fixed effect model, linear regression
library(broom) # tidy output
library(tidyverse) # general wrangling functions
library(cowplot) # to produce plot of plots
library(GGally)
library(ggrepel)
library(scales)
library(latex2exp)
library(viridis)
library(ggcorrplot)
library(rsample)
set.seed(999)| Package | Use |
|---|---|
| here | to easily load and save data |
| readr | to import the CSV file data |
| [car] | to calculate vif values |
| [forcats] | to collapse levels of factors into more summarised versions |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Below is a table from the Donohue, et al. paper that shows the data used in both analyses, where DAW stands for Donohue, et al. and LM stands for Lott and Mustard.
We will be using a subset of these variables, which are highlighted in green:
To obtain information about age, sex, and race, and overall population we will use US Census Bureau data, just like both of the articles. The cesnus data is available for different time spans. Here are the links for the years used in our analysis. We will use data from 1977 to 2010.
| Data | Link |
|---|---|
| years 1977 to 1979 | link |
| years 1980 to 1989 | link * county data was used for this decade which also has state information |
| years 1990 to 1999 | link |
| years 2000 to 2010 | link technical documentation |
To import the data we will use the read_csv() function of the readr package for the csv files. In some decades, there are separate files for each year, we will read each of these together using the base list.files() function to get all of the names for each file and then the map() function of the purrr package to apply the read_csv() function on all of the file paths in the list created by list.files(). For years that are txt files we will use read_table2() also fo the readr package. The read_table2() function, unlike the read_table(), allows for any number of whitespace characters between columns, and the lines can be of different lengths.
AVOCADO I am a bit confused about the last decade… it’s only one file but it seems to need map…
dem_77_79 <- read_csv("docs/Demographics/Decade_1970/pe-19.csv", skip = 5)
dem_80_89 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1980/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(., skip=5))
dem_90_99 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1990/",
pattern = "*.txt",
full.names = TRUE) %>%
map(~read_table2(., skip = 14))
dem_00_10_2 <- read_csv("docs/Demographics/Decade_2000/st-est00int-alldata.csv")
dem_00_10 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_2000/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(.))
head(dem_00_10)[[1]]
# A tibble: 62,244 x 21
REGION DIVISION STATE NAME SEX ORIGIN RACE AGEGRP ESTIMATESBASE20…
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 Unit… 0 0 0 0 281424600
2 0 0 0 Unit… 0 0 0 1 19176154
3 0 0 0 Unit… 0 0 0 2 20549855
4 0 0 0 Unit… 0 0 0 3 20528425
5 0 0 0 Unit… 0 0 0 4 20218782
6 0 0 0 Unit… 0 0 0 5 18962964
7 0 0 0 Unit… 0 0 0 6 19381792
8 0 0 0 Unit… 0 0 0 7 20511067
9 0 0 0 Unit… 0 0 0 8 22707390
10 0 0 0 Unit… 0 0 0 9 22442442
# … with 62,234 more rows, and 12 more variables: POPESTIMATE2000 <dbl>,
# POPESTIMATE2001 <dbl>, POPESTIMATE2002 <dbl>, POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# CENSUS2010POP <dbl>, POPESTIMATE2010 <dbl>
Notice that the STATE variable for the demographic data is numeric. That is because it is encoded by Federal Information Processing Standard (FIPS) state codes{target="_blank". Thus we also need to import data about FIPS encoding so that we can identify what data corresponds to what state.
The following data was downloaded from the US Census Bureau.
To import the data we will use the read_xls() function of the readxl package. Since the first five lines of this excel is information about the source of the data and when it was released, we need to skip importing these lines using the skip argument so that the data has the same number of columns for each row.
# A tibble: 64 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
7 1 1 44 Rhode Island
8 1 1 50 Vermont
9 1 2 00 Middle Atlantic Division
10 1 2 34 New Jersey
# … with 54 more rows
The following data was downloaded from the Federal Bureau of Investigation.
The read_csv() function of the readr package guesses what the class is for each variable, but sometimes it makes mistakes. It is good to specify the class for variables if you know them. We know that we want the variables about male and female counts to be numeric. We can specify that using the col_types = argument. See here and here for more information.
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv")
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = "n",
female_total_ct = "n"))
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = col_double(),
female_total_ct = col_double()))
head(ps_data) # A tibble: 6 x 21
data_year ori pub_agency_name pub_agency_unit state_abbr division_name
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1960 AK02… Alcohol Bevera… <NA> AK Pacific
2 1960 AL00… Homewood <NA> AL East South C…
3 1960 AL01… Coffeeville <NA> AL East South C…
4 1960 AL01… Coffee <NA> AL East South C…
5 1960 AL02… Mentone <NA> AL East South C…
6 1960 AL03… Greensboro <NA> AL East South C…
# … with 15 more variables: region_name <chr>, county_name <chr>,
# agency_type_name <chr>, population_group_desc <chr>, population <dbl>,
# male_officer_ct <dbl>, male_civilian_ct <dbl>, male_total_ct <dbl>,
# female_officer_ct <lgl>, female_civilian_ct <lgl>, female_total_ct <dbl>,
# officer_ct <lgl>, civilian_ct <lgl>, total_pe_ct <lgl>,
# pe_ct_per_1000 <lgl>
The following data was downloaded from the U.S. Bureau of Labor Statistics.
There are excel files for each state. As you can see, there are many rows to skip to make sure that there are the same number of columns for each row. We can also see that the state name is located in a couple of the first rows.
We can also see that here if we just try to read in the files directly.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(.))
head(ue_rate_data)[1][[1]]
# A tibble: 55 x 14
`Local Area Une… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Original Data V… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Series Id: LAUS… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 Not Seasonally … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 Area: Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 Area Type: Stat… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 State/Region/Di… Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
8 Measure: unem… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
9 Years: 1977… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# … with 45 more rows, and 3 more variables: ...12 <chr>, ...13 <chr>,
# ...14 <chr>
So now we will skip the first 10 lines. And also create a names tibble that contains only the cell with the state information.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., skip = 10))
head(ue_rate_data[1])[[1]]
# A tibble: 44 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977 7.5 9 7.7 7.2 6.8 8.6 8 7.8 6.7 6.3 6.3 6
2 1978 7.1 6.9 6.2 5.4 5.1 6.9 6.7 6.7 6.5 6.3 6.3 6.5
3 1979 6.7 7.5 6.9 6.6 6.4 8.4 7.7 7.8 7.1 7.2 6.9 6.7
4 1980 7.7 7.8 7.4 7.4 8.4 9.7 10.4 10.3 9.3 9.6 9.4 9
5 1981 10 10.3 9.5 9.1 9.4 11.1 10.4 10.9 10.8 11.7 11.5 11.8
6 1982 13.2 13.2 12.9 12.6 12.8 14.5 14.7 14.8 14.7 15.1 15.4 15.3
7 1983 16 16 14.5 13.7 13.3 14.6 13.9 13.8 13.2 12.8 12.1 11.8
8 1984 12.5 12.4 11.4 10.8 10.1 11.3 11.5 11.3 10.8 10.2 9.7 10.1
9 1985 10.7 10.5 9.8 8.7 8.4 9.6 9.2 8.8 8.6 8.6 8.4 8.7
10 1986 9.3 10.4 10.1 9.4 9.4 10.5 9.7 9.6 9.7 9.7 9.6 9
# … with 34 more rows, and 1 more variable: Annual <dbl>
To get the state name for each file using the map() function to perform functions across all of the files, we will specifically import only a small range of cells using the range = argument and then grab the cell that has state information based on it’s location within the range of cells imported using c() and then use the base unlist() function to unlist the list that this creates.
ue_rate_names <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., range = "B4:B6")) %>%
map(., c(1,2)) %>%
unlist()
ue_rate_names [1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida" "Georgia" "Hawaii"
[13] "Idaho" "Illinois" "Indiana"
[16] "Iowa" "Kansas" "Kentucky"
[19] "Louisiana" "Maine" "Maryland"
[22] "Massachusetts" "Michigan" "Minnesota"
[25] "Mississippi" "Missouri" "Montana"
[28] "Nebraska" "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico" "New York"
[34] "North Carolina" "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon" "Pennsylvania"
[40] "Rhode Island" "South Carolina" "South Dakota"
[43] "Tennessee" "Texas" "Utah"
[46] "Vermont" "Virginia" "Washington"
[49] "West Virginia" "Wisconsin" "Wyoming"
Extracted from Table 21 from US Census Bureau Poverty Data
AVOCado strange issue
#**persistent warning from unknown origin** https://community.rstudio.com/t/persistent-unknown-or-uninitialised-column-warnings/64879
#solution to above is alledgedly: "In any case the suggested approach is to initialize the column"
poverty_rate_data <- read_xls("docs/Poverty/hstpov21.xls", skip=2) #This may cause initialization issue, not easily reproducible (even after restarting R)
head(poverty_rate_data)# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
We can see that this will require some wranlging to make the data more usable.
Violent crime data was obtained from here This data is a bit trickier because of spaces and / in the column names, thus the read_lines() function of the readr package works better than the read_csv() function.
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
head(crime_data)[1] "Estimated crime in Alabama"
[2] "\n,,National or state crime,,,,,,,"
[3] "\n,,Violent crime,,,,,,,"
[4] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[6] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
We can see that this data will also require some wranlging to make it more usable.
This data is extracted from table in Donohue paper {target="_blank"}. We will use the function pdf_text() of the pdftools package to import the pdf document.
if(!file.exists(here("docs", "w23510.pdf"))){
url <- "https://www.nber.org/papers/w23510.pdf"
utils::download.file(url, here("docs", "w23510.pdf"))
}
DAWpaper <- pdf_text(here("docs", "w23510.pdf"))
head(DAWpaper[1])[1] " NBER WORKING PAPER SERIES\n RIGHT-TO-CARRY LAWS AND VIOLENT CRIME:\n A COMPREHENSIVE ASSESSMENT USING PANEL DATA AND\n A STATE-LEVEL SYNTHETIC CONTROL ANALYSIS\n John J. Donohue\n Abhay Aneja\n Kyle D. Weber\n Working Paper 23510\n http://www.nber.org/papers/w23510\n NATIONAL BUREAU OF ECONOMIC RESEARCH\n 1050 Massachusetts Avenue\n Cambridge, MA 02138\n June 2017, Revised November 2018\nPreviously circulated as \"Right-to-Carry Laws and Violent Crime: A Comprehensive Assessment\nUsing Panel Data and a State-Level Synthetic Controls Analysis.\" We thank Dan Ho, Stefano\nDellaVigna, Rob Tibshirani, Trevor Hastie, StefanWager, Jeff Strnad, and participants at the\n2011 Conference of Empirical Legal Studies (CELS), 2012 American Law and Economics\nAssociation (ALEA) Annual Meeting, 2013 Canadian Law and Economics Association (CLEA)\nAnnual Meeting, 2015 NBER Summer Institute (Crime), and the Stanford Law School faculty\nworkshop for their comments and helpful suggestions. Financial support was provided by\nStanford Law School. We are indebted to Alberto Abadie, Alexis Diamond, and Jens\nHainmueller for their work developing the synthetic control algorithm and programming the Stata\nmodule used in this paper and for their helpful comments. The authors would also like to thank\nAlex Albright, Andrew Baker, Jacob Dorn, Bhargav Gopal, Crystal Huang, Mira Korb, Haksoo\nLee, Isaac Rabbani, Akshay Rao, Vikram Rao, Henrik Sachs and Sidharth Sah who provided\nexcellent research assistance, as well as Addis O’Connor and Alex Chekholko at the Research\nComputing division of Stanford’s Information Technology Services for their technical support.\nThe views expressed herein are those of the author and do not necessarily reflect the views of the\nNational Bureau of Economic Research.\nNBER working papers are circulated for discussion and comment purposes. They have not been\npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies\nofficial NBER publications.\n© 2017 by John J. Donohue, Abhay Aneja, and Kyle D. Weber. All rights reserved. Short\nsections of text, not to exceed two paragraphs, may be quoted without explicit permission\nprovided that full credit, including © notice, is given to the source.\n"
Again, this data will also require quite a bit of wrangling.
Let’s first take a look at our state FIPS data to see if it needs any cleaning or reshaping. We should start with this data, becuase we will need to use it to wrangle some of the other data.
# A tibble: 6 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
We only need the last two columns, but we might want to rename them. The Name variable is vague. The variable with the FIPS code is called State\n(FIPS). To get rid of the new line in this variable name and to change the Name variable to something more informative, we will use the rename() function of the dplyr package. To use this function, we need to list the new name first followed by = and then the existing variable. We can rename multiple variables at the same time by using a comma to separate the variables we are renaming. We will use the select() function also of the dplyr package just to keep these variables, and we will filter out the rows with FIPS values of 00 with the filter() function, agian also part of the dplyr package. we will specify that we want STATEFP values that are not equal to 00 by using this operator: !=. We will also use the double pipe operator %<>% of the magrittr package which allows us to use data as iuput and then reassign it after we peform sum functions using it.
STATE_FIPS %<>%
dplyr::rename( STATEFP = `State\n(FIPS)`,
STATE = Name) %>%
dplyr::select(STATEFP, STATE) %>%
dplyr::filter(STATEFP != "00")
STATE_FIPS# A tibble: 51 x 2
STATEFP STATE
<chr> <chr>
1 09 Connecticut
2 23 Maine
3 25 Massachusetts
4 33 New Hampshire
5 44 Rhode Island
6 50 Vermont
7 34 New Jersey
8 36 New York
9 42 Pennsylvania
10 17 Illinois
# … with 41 more rows
Click here to see detailed information about how the demogrphic data was wrangled
1977-1979
Now let’s take a look at our demographic data across the decades that we wish to study. If you have very wide data (meaning it has many columns), one way to view the data so that you can see all of the columns at the same time is to use the glimpse() function of the dplyr package.
Taking a look at the first decade of data, we can see that the Race/Sex Indicator contains two types of data, the race and the sex. This does not follow the tidy data philosophy, where each cell of a tibble should only contain one piece of information. Typically one might think of using the separate() function of the tidyr package to split this variable into two. However, one of the race values is Other races and since this also has a space, this makes separating this data more tricky.
Instead we will use the str_extract() function of the stringr package and the mutate() function of the dplyr package. The “mutate()” will allow us to create new variables, and “str_extract()” function will allow us to match specific patterns and pull out matches to those patterns. Therefore, if the Race/Sex Indicator value is Other races male and if we extract patterns matching either "male" or "female" which we can specify like this pattern = "male|female" then, the value will be male.
First we need to rename the Race/Sex Indicator varaible to not have spaces so that it is compatible with the str_extract() function.
We also want to rename a couple of variables to be simpler and filter the data to only include the years of the data we are interested in, as well as remove some variables that we dont need like the FIPS State Code. We can remove variables by using the select() function with a - minus sign in front of the variable we wish to remove.
Rows: 3,060
Columns: 22
$ `Year of Estimate` <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, …
$ `FIPS State Code` <chr> "01", "01", "01", "01", "01", "01", "02", "02", …
$ `State Name` <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black male", "Bla…
$ `Under 5 years` <dbl> 105856, 100613, 47403, 47079, 244, 250, 12382, 1…
$ `5 to 9 years` <dbl> 120876, 115194, 55443, 54851, 255, 251, 13888, 1…
$ `10 to 14 years` <dbl> 129091, 122352, 60427, 60065, 253, 245, 13255, 1…
$ `15 to 19 years` <dbl> 119500, 116107, 52921, 55144, 281, 254, 11179, 9…
$ `20 to 24 years` <dbl> 103665, 108513, 29948, 35165, 413, 331, 20237, 1…
$ `25 to 29 years` <dbl> 86538, 88359, 19535, 23662, 239, 302, 12538, 107…
$ `30 to 34 years` <dbl> 74452, 77595, 17196, 22021, 236, 284, 10331, 865…
$ `35 to 39 years` <dbl> 71511, 74941, 16654, 22248, 161, 279, 9548, 7510…
$ `40 to 44 years` <dbl> 75242, 78908, 17564, 24249, 127, 253, 8282, 6353…
$ `45 to 49 years` <dbl> 73874, 78589, 18186, 23028, 108, 148, 6995, 5820…
$ `50 to 54 years` <dbl> 68048, 72481, 17618, 22104, 95, 100, 5609, 4494,…
$ `55 to 59 years` <dbl> 61071, 67699, 18118, 21909, 88, 93, 4029, 2986, …
$ `60 to 64 years` <dbl> 52361, 61065, 16456, 20068, 69, 94, 2392, 1830, …
$ `65 to 69 years` <dbl> 38977, 49685, 14498, 19364, 54, 73, 1292, 965, 2…
$ `70 to 74 years` <dbl> 26767, 37227, 9541, 12509, 70, 66, 602, 496, 8, …
$ `75 to 79 years` <dbl> 17504, 27163, 6030, 8291, 31, 52, 326, 305, 1, 5…
$ `80 to 84 years` <dbl> 9937, 16470, 3485, 5031, 37, 30, 211, 186, 4, 5,…
$ `85 years and over` <dbl> 5616, 10445, 2448, 4035, 76, 29, 143, 126, 19, 4…
dem_77_79 <- dem_77_79 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select(-`FIPS State Code`, -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`,
"STATE" = `State Name`) %>%
filter(YEAR %in% 1977:1979)
glimpse(dem_77_79)Rows: 918
Columns: 22
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ `Under 5 years` <dbl> 98814, 94595, 46201, 45784, 590, 621, 14316, 1353…
$ `5 to 9 years` <dbl> 113365, 107395, 50097, 49329, 672, 660, 14621, 13…
$ `10 to 14 years` <dbl> 123107, 116182, 54925, 53955, 677, 653, 14795, 13…
$ `15 to 19 years` <dbl> 135343, 130433, 58468, 59926, 674, 605, 15207, 13…
$ `20 to 24 years` <dbl> 126053, 125352, 43898, 51433, 722, 773, 20106, 16…
$ `25 to 29 years` <dbl> 111547, 112471, 31014, 36648, 638, 835, 20444, 18…
$ `30 to 34 years` <dbl> 100674, 101543, 22528, 26694, 571, 766, 17514, 15…
$ `35 to 39 years` <dbl> 81038, 83369, 17473, 22213, 498, 586, 13098, 1069…
$ `40 to 44 years` <dbl> 75042, 77793, 16446, 22146, 356, 479, 10067, 7935…
$ `45 to 49 years` <dbl> 76296, 79753, 16578, 22576, 295, 432, 8460, 6848,…
$ `50 to 54 years` <dbl> 74844, 81079, 17117, 23028, 206, 326, 7268, 5914,…
$ `55 to 59 years` <dbl> 67785, 75905, 16437, 21435, 166, 213, 5398, 4485,…
$ `60 to 64 years` <dbl> 58853, 69406, 16276, 21075, 145, 174, 3349, 2708,…
$ `65 to 69 years` <dbl> 48848, 62430, 15837, 21126, 107, 173, 1714, 1468,…
$ `70 to 74 years` <dbl> 34475, 50075, 11450, 16028, 90, 138, 915, 928, 22…
$ `75 to 79 years` <dbl> 20977, 34027, 7601, 10825, 53, 106, 500, 493, 10,…
$ `80 to 84 years` <dbl> 10831, 21483, 3896, 6272, 25, 49, 237, 268, 4, 7,…
$ `85 years and over` <dbl> 6683, 15729, 2667, 5426, 33, 41, 153, 211, 11, 6,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
That’s looking pretty good! We also want to take all the age group variabels and make one variable that is the age group name and one that is the value of the population count for that age group. To do this we will use the pivot_longer() function of the tidyr package. To use this function, we need to use the cols argument to indicate which columns we want to pivot. We also name the new variables we will create with the names_to and values_to arguments. The names_to will be the name of the variable that will identify each age group and values_to will be the name of the variable that contains the corresponding population values.
dem_77_79 <- dem_77_79 %>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP")
glimpse(dem_77_79)Rows: 16,524
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ SEX <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "Whit…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 1…
$ SUB_POP <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 8103…
We also want to get data about the total population for the state for each year.
To do so we can sum all the values for the SUB_POP variable that we just created. To do this we can use the group_by and summarise() functions of the dplyr package. The group_by() function specifies how we want to calculate our sum, that we would like to calculate it for each year and each state individually. Thus, all the values that have the same STATE and YEAR values will be summed together, rather than summing using all of the values in the SUB_POP variable. The .groups argument allows us to remove the grouping after we peform the calculation with summarise().
pop_77_79 <- dem_77_79 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
pop_77_79 # A tibble: 153 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
7 1977 Connecticut 3088745
8 1977 Delaware 594815
9 1977 District of Columbia 681766
10 1977 Florida 8888806
# … with 143 more rows
Now we will add the population value to the demographic tibble using the left_join() function of the dplyr package. It is imporant that we specify how this should be done, that the YEAR and STATE variable vlaues should match eachother. This will place the dem_77_79 variables to the left of the pop_77_79 data.
# A tibble: 16,524 x 7
YEAR STATE SEX RACE AGE_GROUP SUB_POP TOT_POP
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1977 Alabama male White Under 5 years 98814 3782571
2 1977 Alabama male White 5 to 9 years 113365 3782571
3 1977 Alabama male White 10 to 14 years 123107 3782571
4 1977 Alabama male White 15 to 19 years 135343 3782571
5 1977 Alabama male White 20 to 24 years 126053 3782571
6 1977 Alabama male White 25 to 29 years 111547 3782571
7 1977 Alabama male White 30 to 34 years 100674 3782571
8 1977 Alabama male White 35 to 39 years 81038 3782571
9 1977 Alabama male White 40 to 44 years 75042 3782571
10 1977 Alabama male White 45 to 49 years 76296 3782571
# … with 16,514 more rows
We will also calculate the percentage that each group makes up of the total population, by dividing the SUB_POP by the TOT_POP and multiplying by 100 using the mutate() function. we will also remove the other population variables.
dem_77_79 %<>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
select(-SUB_POP, -TOT_POP)
dem_77_79# A tibble: 16,524 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama male White Under 5 years 2.61
2 1977 Alabama male White 5 to 9 years 3.00
3 1977 Alabama male White 10 to 14 years 3.25
4 1977 Alabama male White 15 to 19 years 3.58
5 1977 Alabama male White 20 to 24 years 3.33
6 1977 Alabama male White 25 to 29 years 2.95
7 1977 Alabama male White 30 to 34 years 2.66
8 1977 Alabama male White 35 to 39 years 2.14
9 1977 Alabama male White 40 to 44 years 1.98
10 1977 Alabama male White 45 to 49 years 2.02
# … with 16,514 more rows
It is important to make sure that we have the total values we would expect. We have two levels of SEX, three levels of Race, three levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,524 which is the same as the number of rows in our final dem_77_79 data. Looks good!
Also Let’s make the values of the SEX variable capatalized so that they match the other values of the other variables like RACE etc. This will help us to keep consistent values across the different years as we wrangle the data for the other decades. To do so we will use the str_to_title() function of the stringr package. We need to use the pull() function to get the values of SEX out of dem_77_79. Once we make them captialized they are then reasigned to the SEX variable.
dem_77_79 %<>%
mutate(SEX = str_to_title(pull(dem_77_79, SEX)))
# This can also be done line this:
dem_77_79 %<>%
mutate(SEX = str_to_title(pull(., SEX)))1980-1989
For this decade each year is a separate tibble and they are combined as a list.
[1] "list"
So the first thing we need to do is combine each tibble of the list together. We can do that using the bind_rows() function of dplyr which appends the data together based on the presence of columns with the same name in the different tibbles. We will use the map_df() function of the purrr package to allow us to do this across each tibble in our list.
Rows: 188,460
Columns: 21
$ `Year of Estimate` <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black ma…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
Great! Now our data is all together.
Now we will wrangle the data similarly to the previous decade.
dem_80_89 <- dem_80_89 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select( -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`)
glimpse(dem_80_89)Rows: 188,460
Columns: 22
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
$ SEX <chr> "male", "female", "male", "female", "ma…
$ RACE <chr> "White", "White", "Black", "Black", "Ot…
Notice that this time the state information is based on the numeric FIPS value. We want only the first two values, as the rest indicate the county. We can use the str_sub() function of the stringr package for this. We will specify that we want to start at the first position and end at the second. Just like str_extract() we need to rename this variable first so that it is compatible.
dem_80_89 %<>%
rename("STATEFP_temp" = "FIPS State and County Codes") %>%
mutate(STATEFP = str_sub(STATEFP_temp, start = 1, end = 2)) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
glimpse(dem_80_89)Rows: 188,460
Columns: 23
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1…
$ STATEFP_temp <chr> "01001", "01001", "01001", "01001", "01001", "010…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 672, 645, 3…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, 740, 680, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614, 644, 670…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841, 711, 762…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, 516, 601, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, 414, 469, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, 303, 352, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202, 224, 260,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, 206, 288, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, 219, 236, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 203, 261, 7…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 178, 219, 8…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 171, 209, 8…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 170, 232, 6…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 164, 182, 4,…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 129, 3, 6, …
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67, 1, 2, 56…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65, 1, 1, 30,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
dem_80_89 %<>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP_temp") %>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 50108
2 1980 Alabama 10 to 14 years female Other 805
3 1980 Alabama 10 to 14 years female White 109066
4 1980 Alabama 10 to 14 years male Black 50768
5 1980 Alabama 10 to 14 years male Other 826
6 1980 Alabama 10 to 14 years male White 115988
7 1980 Alabama 15 to 19 years female Black 58428
8 1980 Alabama 15 to 19 years female Other 743
9 1980 Alabama 15 to 19 years female White 126783
10 1980 Alabama 15 to 19 years male Black 56808
# … with 55,070 more rows
pop_80_89 <- dem_80_89 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
dem_80_89 <- dem_80_89 %>%
left_join(pop_80_89, by = c("YEAR","STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
dplyr::select(-SUB_POP, -TOT_POP)
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 1.28
2 1980 Alabama 10 to 14 years female Other 0.0206
3 1980 Alabama 10 to 14 years female White 2.80
4 1980 Alabama 10 to 14 years male Black 1.30
5 1980 Alabama 10 to 14 years male Other 0.0212
6 1980 Alabama 10 to 14 years male White 2.97
7 1980 Alabama 15 to 19 years female Black 1.50
8 1980 Alabama 15 to 19 years female Other 0.0191
9 1980 Alabama 15 to 19 years female White 3.25
10 1980 Alabama 15 to 19 years male Black 1.46
# … with 55,070 more rows
Just like with the data from the 70s we will also change the values for SEX to be capitalized.
Again, it is important to make sure that we have the total values we would expect. This time we have: two levels of SEX, three levels of Race, ten levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE.
If we multiply these together we get 55,080, which is the same as the number of rows of the final dem_80_89 data. Looks good!
1990-1999
Just like the 80s we need to combine the data across the files:
Rows: 43,870
Columns: 19
$ Year <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 19…
$ e <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Male <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605, …
$ Female <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316, …
$ Male_1 <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 10…
$ Female_1 <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 97…
$ Male_2 <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179,…
$ Female_2 <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 202…
$ Male_3 <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, 1…
$ Female_3 <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, 1…
$ Male_4 <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, 1…
$ Female_4 <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, 1…
$ Male_5 <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32, …
$ Female_5 <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29, …
$ Male_6 <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4,…
$ Female_6 <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6,…
$ Male_7 <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 11…
$ Female_7 <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3, …
For this decade the column names can’t all be imported in a simple way from the table, so they need to be recoded.
Here is what the data looks like before importing:
So, first using the base colnames() function we change the names of the column names.
colnames(dem_90_99) <- c("YEAR",
"STATEFP",
"Age",
"NH_W_M",
"NH_W_F",
"NH_B_M",
"NH_B_F",
"NH_AIAN_M",
"NH_AIAN_F",
"NH_API_M",
"NH_API_F",
"H_W_M",
"H_W_F",
"H_B_M",
"H_B_F",
"H_AIAN_M",
"H_AIAN_F",
"H_API_M",
"H_API_F")
glimpse(dem_90_99)Rows: 43,870
Columns: 19
$ YEAR <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ NH_W_M <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605,…
$ NH_W_F <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316,…
$ NH_B_M <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 1…
$ NH_B_F <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 9…
$ NH_AIAN_M <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179…
$ NH_AIAN_F <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 20…
$ NH_API_M <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, …
$ NH_API_F <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, …
$ H_W_M <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, …
$ H_W_F <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, …
$ H_B_M <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32,…
$ H_B_F <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29,…
$ H_AIAN_M <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4…
$ H_AIAN_F <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6…
$ H_API_M <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 1…
$ H_API_F <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3,…
Notice also that the first row is all NA values from white space in the orginal table for 1990, this is probably true for each year. We can check them dimensions of our table using the base dim() function. When we filter for rows where YEAR is NA, we indeed see 10 rows, which is what we would expect if we have a row like this for each of the years in the decade. We see the same if we try a different variable. Now we will test to see how large our tibble is if we drop rows with NA values using the drop_na() function of tidyr. We that indeed our dimensions only changed by ten, so there are not other rows with missing values that we might not expect. So now we will resign the dem_90_99 variable after removing these rows.
[1] 43870 19
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 43,860 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 01 0 20406 19101 9794 9414 103 90 192
2 1990 01 1 19393 18114 9475 9247 87 93 146
3 1990 01 2 18990 18043 9097 8837 97 100 175
4 1990 01 3 19246 17786 9002 8701 94 115 150
5 1990 01 4 19502 18366 9076 8989 108 114 168
6 1990 01 5 19560 18386 9169 9093 128 130 170
7 1990 01 6 19091 18047 8919 8736 160 134 183
8 1990 01 7 19605 18316 9219 9192 178 162 171
9 1990 01 8 18823 17743 9247 9108 166 155 136
10 1990 01 9 20226 19178 10194 9784 205 193 177
# … with 43,850 more rows, and 9 more variables: NH_API_F <dbl>, H_W_M <dbl>,
# H_W_F <dbl>, H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>,
# H_API_M <dbl>, H_API_F <dbl>
Then we sum across the nonhispanic and hispaninc groups because this information is not available for the other previous decades. Then we will remove the variables for the hispanic and nonhispanic subgroups using select().
dem_90_99%<>%
mutate(W_M = NH_W_M + H_W_M,
W_F = NH_W_F + H_W_F,
B_M = NH_B_M + H_B_M,
B_F = NH_B_F + H_B_F,
AIAN_M = NH_AIAN_M + H_AIAN_M,
AIAN_F = NH_AIAN_F + H_AIAN_F,
API_M = NH_API_M + H_API_M,
API_F = NH_API_F + H_API_F) %>%
select(-starts_with("NH_"), -starts_with("H_"))
glimpse(dem_90_99)Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 19005…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 17916…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226, …
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812, 1…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 209…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 198…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164, 1…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168, 1…
Looking better! We also need to add age groups like the other decades. We will take a look at the 80s data using the distinct() function of the dplyr package to see what age groups we need. We can use the base cut() function to create a new variable with mutate() called AGE_GROUP that will have a label for every change in 5 years of age. The right = FALSE argument specifies that the interval is not closed on the right, meaning that if the value is at the cutpoint like the Age value is 5, then it will be in the 5 to 9 years group.
We can make the labels for the AGE_GROUP variable match those of dem_77_79 but we need to pull out the values of the tibble created by distinct(). To do this we can use the pull() function from the dplyr package. Note that it is important to check that the AGE_GROUP values are listed in order for dem_77_79. We will also remove the Age variable after we create the new AGE_GROUP variable for the dem_90_99 data.
# A tibble: 18 x 1
AGE_GROUP
<chr>
1 Under 5 years
2 5 to 9 years
3 10 to 14 years
4 15 to 19 years
5 20 to 24 years
6 25 to 29 years
7 30 to 34 years
8 35 to 39 years
9 40 to 44 years
10 45 to 49 years
11 50 to 54 years
12 55 to 59 years
13 60 to 64 years
14 65 to 69 years
15 70 to 74 years
16 75 to 79 years
17 80 to 84 years
18 85 years and over
[1] "Under 5 years" "5 to 9 years" "10 to 14 years"
[4] "15 to 19 years" "20 to 24 years" "25 to 29 years"
[7] "30 to 34 years" "35 to 39 years" "40 to 44 years"
[10] "45 to 49 years" "50 to 54 years" "55 to 59 years"
[13] "60 to 64 years" "65 to 69 years" "70 to 74 years"
[16] "75 to 79 years" "80 to 84 years" "85 years and over"
dem_90_99 %<>%
mutate(AGE_GROUP = cut(Age,
breaks = seq(0,90, by=5),
right = FALSE, labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP))) %>%
select(-Age)
glimpse(dem_90_99)Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 190…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 179…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226…
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812,…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 2…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 1…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164,…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168,…
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years,…
Like the previous decades we will create a RACE and SUB_POP variable using pivot_longer() to create a single Race variable out of all the subgroup variables.
Now we need to collapse the data for the various races so that it matches the previous decades. This time we will use the case_when() function of the dplyr package and the str_detect() function of the stringr package to identify when the race is something other than B or W and replace with the value Other. The value to the right of the ~ indicates what we want the value of the new variable to be if the value of the variable we are using with str_decect() matches the condition specified. If the value does not match the specified condition, than the other values will be what ever is listed after TRUE ~. We will then create population counts as we did previously for the other decades.
Finally, we will create new sums for the subpopulations where we sum across the two Other subgroups Race to a create a single value for each value of YEAR, SEX, AGE_GROUP, and STATE by using the group_by() function and summarie().
dem_90_99 %<>%
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP_temp")
dem_90_99 %<>%
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other")) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
dem_90_99 %<>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")pop_90_99 <- dem_90_99 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_90_99 <- dem_90_99 %>%
left_join(pop_90_99, by=c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
dplyr::select(-SUB_POP, -TOT_POP)
dem_90_99# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <fct> <chr> <chr> <dbl>
1 1990 Alabama Under 5 years Female Black 1.12
2 1990 Alabama Under 5 years Female Other 0.0347
3 1990 Alabama Under 5 years Female White 2.28
4 1990 Alabama Under 5 years Male Black 1.15
5 1990 Alabama Under 5 years Male Other 0.0336
6 1990 Alabama Under 5 years Male White 2.43
7 1990 Alabama 5 to 9 years Female Black 1.14
8 1990 Alabama 5 to 9 years Female Other 0.0419
9 1990 Alabama 5 to 9 years Female White 2.29
10 1990 Alabama 5 to 9 years Male Black 1.16
# … with 55,070 more rows
Again, we should check to make sure that we have the total values we would expect. We have the same number of unique values for each of our variables as in with the data from the 80s, so if we collpased the data for the different additional subpopulations in this data, then we have done it correctly.
Indeed it looks like we have 55,080 rows, which is what we would expect and is the same as the number of rows of the final dem_80_89 data. Looks good!
2000-2010
Again, for this decade we need to combine the data across years.
Rows: 62,244
Columns: 21
$ REGION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIVISION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ STATE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ NAME <chr> "United States", "United States", "United States", …
$ SEX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ORIGIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ RACE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ AGEGRP <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, …
$ POPESTIMATE2000 <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, …
$ POPESTIMATE2001 <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, …
$ POPESTIMATE2002 <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, …
$ POPESTIMATE2003 <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, …
$ POPESTIMATE2004 <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, …
$ POPESTIMATE2005 <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, …
$ POPESTIMATE2006 <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, …
$ POPESTIMATE2007 <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, …
$ POPESTIMATE2008 <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, …
$ POPESTIMATE2009 <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, …
$ CENSUS2010POP <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, …
$ POPESTIMATE2010 <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, …
Ok, the data looks a bit different from the others. First we will remove a couple of variables that we probably don’t need. Also it looks like we have some values for the entire United Sates and we will drop these to be like the other decades.
We can see that there are lots of values that are zero. According to the technical documentation for this data, zero values indicate the total for the other categories of Sex, Origin, Race, and AGEGRP.
So we will drop the total values for SEX, RACE, and AGEGRP by removing the rows where these variables are equal to zero.
We will also want to only select for the total values for Origin as we do not wish to divide the data into subgroups about hispanic ethnicity because we do not have that information for the first two decades. Thus we will filter for only the rows where Origin is equal to zero.
We will also then remove the REGION, Division, STATE, and Origin variables. We will then rename NAME to be STATE and rename AGEGRP to be like the other decades as AGE_GROUP.
dem_00_10 %<>%
filter(SEX != 0,
RACE != 0,
AGEGRP != 0,
ORIGIN == 0) %>%
dplyr::select(-REGION, -DIVISION, -ORIGIN, -STATE) %>%
rename("STATE" = NAME,
"AGE_GROUP" = AGEGRP)
dem_00_10# A tibble: 11,016 x 15
STATE SEX RACE AGE_GROUP POPESTIMATE2000 POPESTIMATE2001 POPESTIMATE2002
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alab… 1 1 1 99527 99985 99578
2 Alab… 1 1 2 104423 102518 101023
3 Alab… 1 1 3 108325 108412 108059
4 Alab… 1 1 4 108638 107370 107337
5 Alab… 1 1 5 104337 107230 108195
6 Alab… 1 1 6 106491 101466 98949
7 Alab… 1 1 7 110116 110630 110416
8 Alab… 1 1 8 123719 120283 116502
9 Alab… 1 1 9 124961 125443 124751
10 Alab… 1 1 10 115024 117010 119354
# … with 11,006 more rows, and 8 more variables: POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# POPESTIMATE2010 <dbl>
Now we need to recode the numeric values to the values in the techincal documentation. We can do so by adding labels to each numeric level using the base function factor().
dem_00_10 %<>%
mutate(SEX = factor(SEX,
levels = 1:2,
labels = c("Male",
"Female")),
RACE = factor(RACE,
levels = 1:6,
labels = c("White",
"Black",
rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP,
levels = 1:18,
labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP)))
glimpse(dem_00_10)Rows: 11,016
Columns: 15
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ SEX <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,…
$ RACE <fct> White, White, White, White, White, White, White, Whit…
$ AGE_GROUP <fct> Under 5 years, 5 to 9 years, 10 to 14 years, 15 to 19…
$ POPESTIMATE2000 <dbl> 99527, 104423, 108325, 108638, 104337, 106491, 110116…
$ POPESTIMATE2001 <dbl> 99985, 102518, 108412, 107370, 107230, 101466, 110630…
$ POPESTIMATE2002 <dbl> 99578, 101023, 108059, 107337, 108195, 98949, 110416,…
$ POPESTIMATE2003 <dbl> 99627, 99920, 108026, 107749, 109360, 98276, 109893, …
$ POPESTIMATE2004 <dbl> 99788, 99306, 107627, 108666, 109037, 98742, 107653, …
$ POPESTIMATE2005 <dbl> 100316, 99754, 106570, 110278, 108727, 100327, 105151…
$ POPESTIMATE2006 <dbl> 100820, 101251, 106228, 111640, 108847, 103869, 10161…
$ POPESTIMATE2007 <dbl> 101766, 101985, 106243, 112353, 109496, 105175, 99917…
$ POPESTIMATE2008 <dbl> 102304, 102479, 106155, 113305, 110007, 106348, 99921…
$ POPESTIMATE2009 <dbl> 101411, 102688, 106130, 113741, 111167, 106497, 10138…
$ POPESTIMATE2010 <dbl> 99480, 102939, 106324, 112272, 112423, 106593, 102923…
OK, we also want to change the shape of the data so that we have a YEAR variable and each estimate of the population is a value in a new variable called SUB_POP_temp.
dem_00_10 %<>%
pivot_longer(cols=contains("ESTIMATE"),
names_to = "YEAR",
values_to = "SUB_POP_temp")We will now clean up the YEAR variable to only be the numeric value by keeping only the last 4 values of each string using the str_sub() function of the stringr package.
Now we will collapse the data for the different RACES and calculate a new SUB_POP value.
dem_00_10 %<>%
group_by(YEAR, AGE_GROUP, STATE, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups = "drop")Agian, the dimensions look as we expect with 60,588 rows. This time we have two levels of SEX, three levels of Race, 11 levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,588. Looks good!
Now we will calculate the total polutation and percent of the total as we have done with the previous decades.
pop_00_10 <- dem_00_10 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")We can also check that our wrangling was performecd correctly by summing the values for the individual subpopulations percentages and seeing if it totals to 100.
dem_00_10 %>%
left_join(pop_00_10, by=c("YEAR", "STATE")) %>%
group_by(YEAR, STATE) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
summarise(perc_tot = sum(PERC_SUB_POP), .groups = "drop") %>%
mutate(poss_error = case_when(abs(perc_tot - 100) > 0 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(poss_error) %>%
tally()# A tibble: 1 x 2
poss_error n
<lgl> <int>
1 FALSE 561
Looks like the percentages for each state for each year all add up to 100, as we would expect. Great! Now we will reasign the dem_00_10 data with this processing.
dem_00_10 %<>%
left_join(pop_00_10, by = c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
select(-SUB_POP, -TOT_POP)
dem_00_10# A tibble: 60,588 x 6
YEAR AGE_GROUP STATE SEX RACE PERC_SUB_POP
<dbl> <fct> <chr> <fct> <fct> <dbl>
1 2000 Under 5 years Alabama Male White 2.24
2 2000 Under 5 years Alabama Male Black 1.05
3 2000 Under 5 years Alabama Male Other 0.101
4 2000 Under 5 years Alabama Female White 2.12
5 2000 Under 5 years Alabama Female Black 1.03
6 2000 Under 5 years Alabama Female Other 0.0995
7 2000 Under 5 years Alaska Male White 2.35
8 2000 Under 5 years Alaska Male Black 0.165
9 2000 Under 5 years Alaska Male Other 1.37
10 2000 Under 5 years Alaska Female White 2.26
# … with 60,578 more rows
OK, now we are ready to combine all of our demgraphic data together!
We can check that the colnames are the same for the data for each of the decades by using the setequal() function of the dplyr package.
[1] TRUE
[1] TRUE
[1] TRUE
We can also confirm that we have the same number of age groups for each decade by using the base length() function. If you did not take a look at the wrangling for the demographic data then you may be unfamiliar with the pull() function of the dplyr package. This allows you to grab the values of a variable from a tibble. The distinct() function which is also of the dplyr package creates a tibble of the unique values for a variable.
[1] 18
[1] 18
[1] 18
[1] 18
Looks good!
Now we will combine the data using the bind_rows() function of the dplyr package. This function appends the data together based on the presence of columns with the same name in the different tibbles.
Rows: 187,272
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 19…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
$ SEX <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
$ RACE <chr> "White", "White", "White", "White", "White", "White", "W…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 t…
$ PERC_SUB_POP <dbl> 2.6123502, 2.9970356, 3.2545853, 3.5780690, 3.3324688, 2…
Great! now we have a really large single tibble.
Now we want to select similar demographic data to what was used in the previous analyses.
Here is the table from the Donohue paper that compares the data used in the analyses.
We can see that only the percentage of males that were from age 15-39 of the race groups (black, white, and other) were used in the Donohue analysis.
Ultimately we intend to make a tibble of data that is similar to each analysis. Therefore, we will create a data tibble about the demogrphaic data for each analysis now.
To do so we will first create a vector of the age groups that should be included in the Donohue-like analysis, that we will call DONOHUE_AGE_GROUPS. We will then filter for only the age groups in this vector by using the filter() function of the dplyr package and the %in% operator to indicate that we want to keep all AGE_GROUP values that are equal to those within DONOHUE_AGE_GROUPS. We also want to filter for only population percentages for males by using the == operator. Then we can collpase the age groups from 20-39 by using the fct_collpase() function of the forcats package.
DONOHUE_AGE_GROUPS <- c("15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")
dem_DONOHUE <- dem %>%
filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
SEX == "Male") %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")))
dem_DONOHUE# A tibble: 26,010 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <fct> <dbl>
1 1977 Alabama Male White 15 to 19 years 3.58
2 1977 Alabama Male White 20 to 39 years 3.33
3 1977 Alabama Male White 20 to 39 years 2.95
4 1977 Alabama Male White 20 to 39 years 2.66
5 1977 Alabama Male White 20 to 39 years 2.14
6 1977 Alabama Male Black 15 to 19 years 1.55
7 1977 Alabama Male Black 20 to 39 years 1.16
8 1977 Alabama Male Black 20 to 39 years 0.820
9 1977 Alabama Male Black 20 to 39 years 0.596
10 1977 Alabama Male Black 20 to 39 years 0.462
# … with 26,000 more rows
We also want to create a new variable that will contain all the demographic information for each percentage just as was done in the Donohue, et al. analysis. This should result in 6 different demographic variables.
To do this we will modify the AGE_GROUP variable by using the mutate() function of the dplyr package. We will replace the spaces in the now two age group categorise with and undesrscore using the str_replace_all() function of the stringr package which replaces all instances of a pattern in a character string.
Then we will use the group_by() function and the summarise() funtion also of the dplyr package to allow us to calculate a sum of the percentages for each of the subpopulation percentages for the newly modifed age groups in AGE_GROUP. The .groups = "drop" argument allows for the grouping to be removed after the summarise() function.
dem_DONOHUE %<>%
mutate(AGE_GROUP = str_replace_all(string = AGE_GROUP,
pattern = " ",
replacement = "_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop")
dem_DONOHUE# A tibble: 10,404 x 6
YEAR STATE RACE SEX AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama Black Male 15_to_19_years 1.55
2 1977 Alabama Black Male 20_to_39_years 3.04
3 1977 Alabama Other Male 15_to_19_years 0.0178
4 1977 Alabama Other Male 20_to_39_years 0.0642
5 1977 Alabama White Male 15_to_19_years 3.58
6 1977 Alabama White Male 20_to_39_years 11.1
7 1977 Alaska Black Male 15_to_19_years 0.163
8 1977 Alaska Black Male 20_to_39_years 0.968
9 1977 Alaska Other Male 15_to_19_years 1.12
10 1977 Alaska Other Male 20_to_39_years 2.73
# … with 10,394 more rows
Now we will combine the variables RACE, SEX, and AGE_GROUP together into one string separated by underscores using the unite function of the tidyr package. we will call this new variable VARIABLE. We will rename the PERC_SUB_POP variable to be VALUE using the rename() function of the dplyr package. The new name should be listed first before the =.
dem_DONOHUE %<>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE" = PERC_SUB_POP)
dem_DONOHUE# A tibble: 10,404 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
7 1977 Alaska Black_Male_15_to_19_years 0.163
8 1977 Alaska Black_Male_20_to_39_years 0.968
9 1977 Alaska Other_Male_15_to_19_years 1.12
10 1977 Alaska Other_Male_20_to_39_years 2.73
# … with 10,394 more rows
Let’s do a quick row number check. We have six different demographic variables, 51 states (DC counts as a state in this case), and 34 different years from 1977 to 2010, we should have 10,404 rows, which we do!
Now, let’s do the same for the “Lott-like” analysis.
So, in this analysis there were 36 variables covering percentages of indiviuals from 10 to over 65, three race groups and both males and females. This table is misprinted and does not include the word “Other” for the third race group that was used.
First we will filter out the age groups that were not included. Then we will collapse the age groups to those that were used by Lott et al. again using the fct_collpase() function of the forcats package.
Also we will again combine the values across the variables to create a new demographic varaible with 36 levels.
LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
"5 to 9 years")
dem_LOTT <- dem %>%
filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL) )%>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
"10 to 19 years"=c("10 to 14 years",
"15 to 19 years"),
"20 to 29 years"=c("20 to 24 years",
"25 to 29 years"),
"30 to 39 years"=c("30 to 34 years",
"35 to 39 years"),
"40 to 49 years"=c("40 to 44 years",
"45 to 49 years"),
"50 to 64 years"=c("50 to 54 years",
"55 to 59 years",
"60 to 64 years"),
"65 years and over"=c("65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)We can indeed check that we have the correct number of levels for VARIABLE using the distinct() function.
# A tibble: 36 x 1
VARIABLE
<chr>
1 Black_Female_10_to_19_years
2 Black_Female_20_to_29_years
3 Black_Female_30_to_39_years
4 Black_Female_40_to_49_years
5 Black_Female_50_to_64_years
6 Black_Female_65_years_and_over
7 Black_Male_10_to_19_years
8 Black_Male_20_to_29_years
9 Black_Male_30_to_39_years
10 Black_Male_40_to_49_years
# … with 26 more rows
We also have population data for each decade that came from the wrangling of the demogrphic data.
We again want to combine this data, so let’s again make sure that all the different tibbles have the same column names.
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1980 Alabama 3899671
2 1980 Alaska 404680
3 1980 Arizona 2735840
4 1980 Arkansas 2288809
5 1980 California 23792840
6 1980 Colorado 2909545
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1990 Alabama 4048508
2 1990 Alaska 553120
3 1990 Arizona 3679056
4 1990 Arkansas 2354343
5 1990 California 29950111
6 1990 Colorado 3303862
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 2000 Alabama 4452173
2 2000 Alaska 627963
3 2000 Arizona 5160586
4 2000 Arkansas 2678588
5 2000 California 33987977
6 2000 Colorado 4326921
Looks good!
population_data <- bind_rows(pop_77_79,
pop_80_89,
pop_90_99,
pop_00_10)
population_data %>%
group_by(YEAR) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 34 x 2
YEAR n
<dbl> <int>
1 1977 51
2 1978 51
3 1979 51
4 1980 51
5 1981 51
6 1982 51
7 1983 51
8 1984 51
9 1985 51
10 1986 51
11 1987 51
12 1988 51
13 1989 51
14 1990 51
15 1991 51
16 1992 51
17 1993 51
18 1994 51
19 1995 51
20 1996 51
21 1997 51
22 1998 51
23 1999 51
24 2000 51
25 2001 51
26 2002 51
27 2003 51
28 2004 51
29 2005 51
30 2006 51
31 2007 51
32 2008 51
33 2009 51
34 2010 51
[1] "data_year" "ori" "pub_agency_name"
[4] "pub_agency_unit" "state_abbr" "division_name"
[7] "region_name" "county_name" "agency_type_name"
[10] "population_group_desc" "population" "male_officer_ct"
[13] "male_civilian_ct" "male_total_ct" "female_officer_ct"
[16] "female_civilian_ct" "female_total_ct" "officer_ct"
[19] "civilian_ct" "total_pe_ct" "pe_ct_per_1000"
ps_data <- ps_data %>%
filter(data_year >= 1977,
data_year <= 2014) %>%
mutate(male_total_ct = case_when(is.na(male_total_ct) ~ 0,
TRUE ~ male_total_ct),
female_total_ct = case_when(is.na(female_total_ct) ~ 0,
TRUE ~ female_total_ct)) %>%
mutate(officer_total = male_total_ct + female_total_ct) %>%
dplyr::select(data_year,
pub_agency_name,
state_abbr,
officer_total)
ps_data <- ps_data %>%
group_by(data_year, state_abbr) %>%
summarise(officer_state_total=sum(officer_total), .groups = "drop")
ps_data %>%
group_by(state_abbr) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 59 x 2
state_abbr n
<chr> <int>
1 AK 38
2 AL 38
3 AR 38
4 AS 38
5 AZ 38
6 CA 38
7 CO 38
8 CT 38
9 CZ 38
10 DC 38
11 DE 38
12 FL 38
13 FS 38
14 GA 38
15 GM 38
16 HI 38
17 IA 38
18 ID 38
19 IL 38
20 IN 38
21 KS 38
22 KY 38
23 LA 38
24 MA 38
25 MD 38
26 ME 38
27 MI 38
28 MN 38
29 MO 38
30 MP 38
31 MS 38
32 MT 38
33 NB 38
34 NC 38
35 ND 38
36 NH 38
37 NJ 38
38 NM 38
39 NV 38
40 NY 38
41 OH 38
42 OK 38
43 OR 38
44 OT 38
45 PA 38
46 PR 38
47 RI 38
48 SC 38
49 SD 38
50 TN 38
51 TX 38
52 UT 38
53 VA 38
54 VI 38
55 VT 38
56 WA 38
57 WI 38
58 WV 38
59 WY 38
# NB is Nebraska. This was changed to NE to avoid confusions with NB in Canada. This dataset uses NB
state_of_interest_NULL <- c("AS",
"GM",
"CZ",
"FS",
"MP",
"OT",
"PR",
"VI")
state_abb_df <- as.data.frame(cbind(state.abb, state.name))
colnames(state_abb_df) <- c("state_abbr", "STATE")
print(state_abb_df) state_abbr STATE
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
11 HI Hawaii
12 ID Idaho
13 IL Illinois
14 IN Indiana
15 IA Iowa
16 KS Kansas
17 KY Kentucky
18 LA Louisiana
19 ME Maine
20 MD Maryland
21 MA Massachusetts
22 MI Michigan
23 MN Minnesota
24 MS Mississippi
25 MO Missouri
26 MT Montana
27 NE Nebraska
28 NV Nevada
29 NH New Hampshire
30 NJ New Jersey
31 NM New Mexico
32 NY New York
33 NC North Carolina
34 ND North Dakota
35 OH Ohio
36 OK Oklahoma
37 OR Oregon
38 PA Pennsylvania
39 RI Rhode Island
40 SC South Carolina
41 SD South Dakota
42 TN Tennessee
43 TX Texas
44 UT Utah
45 VT Vermont
46 VA Virginia
47 WA Washington
48 WV West Virginia
49 WI Wisconsin
50 WY Wyoming
state_abb_df <- state_abb_df %>%
add_row(state_abbr="DC",
STATE="District of Columbia")
denominator_temp <- population_data %>%
dplyr::select(-VARIABLE) %>%
rename("Population_temp"=VALUE)
ps_data <- ps_data %>%
filter(!(state_abbr %in% state_of_interest_NULL)) %>%
mutate(state_abbr = case_when(state_abbr == "NB" ~ "NE",
TRUE ~ state_abbr)) %>%
left_join(state_abb_df, by = "state_abbr") %>%
dplyr::select(-state_abbr) %>%
rename(YEAR = "data_year",
VALUE = "officer_state_total") %>%
mutate(VARIABLE = "officer_state_total") %>%
left_join(denominator_temp, by=c("STATE","YEAR")) %>%
mutate(VALUE = (VALUE*100000) / Population_temp) %>%
mutate(VALUE = lag(VALUE)) %>%
mutate(VARIABLE = "police_per_100k_lag") %>%
dplyr::select(-Population_temp)# A tibble: 1 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020 3.2 2.9 3 13.3 NA NA NA NA NA NA NA NA
# … with 1 more variable: Annual <dbl>
[1] "STATE" "Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun"
[9] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" "Annual"
STATE Year Jan Feb Mar Apr
"character" "numeric" "numeric" "numeric" "numeric" "numeric"
May Jun Jul Aug Sep Oct
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Nov Dec Annual
"numeric" "numeric" "numeric"
# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
colnames(poverty_rate_data) <- c("STATE",
"Total",
"Number",
"Number_se",
"Percent",
"Percent_se")
tail(poverty_rate_data)# A tibble: 6 x 6
STATE Total Number Number_se Percent Percent_se
<chr> <chr> <chr> <chr> <chr> <chr>
1 Wisconsin 4724 403 57 8.5 1.1000000000…
2 Wyoming 468 49 20 10.4 4
3 Standard errors shown in this ta… <NA> <NA> <NA> <NA> <NA>
4 For information on confidentiali… <NA> <NA> <NA> <NA> <NA>
5 Footnotes are available at <www.… <NA> <NA> <NA> <NA> <NA>
6 SOURCE: U.S. Bureau of the Censu… <NA> <NA> <NA> <NA> <NA>
notes <- 4
poverty_rate_data <- poverty_rate_data[-((dim(poverty_rate_data)[1]-notes+1):dim(poverty_rate_data)[1]),]
states_eq <- 51
extra_col <- 2
rep_rows <- states_eq + extra_col
groups <- (dim(poverty_rate_data)[1])/(rep_rows)
paste(groups - (2018-1980 + 1), "extra groups")[1] "2 extra groups"
poverty_rate_data$year_group <- rep(1:groups, each=rep_rows)
poverty_rate_data <- poverty_rate_data %>%
group_by(year_group) %>%
group_split()
head(poverty_rate_data[[1]])# A tibble: 6 x 7
STATE Total Number Number_se Percent Percent_se year_group
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 2018 <NA> <NA> <NA> <NA> <NA> 1
2 STATE Total Number "Standard\ner… Percent "Standard\nerro… 1
3 Alabama 4877 779 "65" 16 "1.3" 1
4 Alaska 720 94 "9" 13.1 "1.2" 1
5 Arizona 7241 929 "80" 12.8000000000… "1.100000000000… 1
6 Arkans… 2912 462 "38" 15.9 "1.3" 1
poverty_rate_data <- poverty_rate_data %>%
map(~mutate(.,
row_id = row_number())) %>%
map(~filter(.,row_id != 2)) %>%
map(~dplyr::select(.,-row_id))
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_replace_all(.,"[:space:]","_")
names(poverty_rate_data) <- poverty_rate_data_names
# Recall 2 extra groups.
# footnotes available at https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-footnotes/cps-historic-footnotes.html
poverty_rate_data$`2017_(21)` <- NULL
poverty_rate_data$`2013_(19)` <- NULL
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_sub(., start = 1, end=4)
names(poverty_rate_data) <- poverty_rate_data_names
poverty_rate_data <- poverty_rate_data %>%
map_df(bind_rows, .id = "YEAR") %>%
dplyr::select(-year_group)
poverty_rate_data <- poverty_rate_data %>%
mutate(n_na = rowSums(is.na(.)))
# This shows that there is systematic missing values stemmingly *solely* from the rows without poverty data and only a label designating the year
poverty_rate_data %>%
group_by(n_na) %>%
tally()# A tibble: 2 x 2
n_na n
<dbl> <int>
1 0 1989
2 5 39
YEAR STATE Total Number Number_se Percent
"character" "character" "character" "character" "character" "character"
Percent_se n_na
"character" "numeric"
poverty_rate_data <- poverty_rate_data %>%
drop_na() %>%
dplyr::select(-Number,
-Number_se,
-Percent_se,
-n_na,
-Total) %>%
rename("VALUE"=Percent) %>%
mutate(VARIABLE = "Poverty_rate",
YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE))
colnames(poverty_rate_data)[1] "YEAR" "STATE" "VALUE" "VARIABLE"
https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
length(crime_data)[1] 2254
crime_data <- crime_data[-(2143:length(crime_data))]
x <- 2014-1977+1
rep_cycle <- 2 + 2 + x
rep_cycle_cut <- 2 + x
delete_rows <- c(seq(2,length(crime_data),rep_cycle),
seq(3,length(crime_data),rep_cycle))
crime_data <- crime_data[-delete_rows]
crime_data <- data.frame(cbind(crime_data, rep(1:(length(crime_data)/rep_cycle_cut),each=rep_cycle_cut)))
colnames(crime_data) <- c("String","STATE_GROUP")
crime_data <- crime_data %>%
group_by(STATE_GROUP) %>%
group_split()
columns_crime_data <- 8
crime_data <- crime_data %>%
map(~mutate(.,
State = case_when(str_detect(String, "Estimated crime in ") ~ substring(String, nchar("Estimated crime in ")+1)),
row_id = row_number())) %>%
map(~fill(., State)) %>%
map(~filter(.,row_id > 2)) %>%
map(~mutate(.,
String = paste0(String, ",", State))) %>%
map(~dplyr::select(.,String)) %>%
map(~str_split_fixed(.$String,",",columns_crime_data + 1)) %>%
map(~data.frame(.)) %>%
map(~rename(.,"YEAR"=X1,
"Extra_col1"=X2,
"VC"=X3,
"Extra_col2"=X4,
"Extra_col3"=X5,
"Extra_col4"=X6,
"Extra_col5"=X7,
"Extra_col6"=X8,
"STATE"=X9)) %>%
map(~dplyr::select(.,-contains("Extra_col"))) %>%
map(~.x %>% mutate_all(~trimws(.,which = "both"))) %>%
map_df(bind_rows)
sapply(crime_data, class) YEAR VC STATE
"character" "character" "character"
DAWpaper_p_62 <- DAWpaper[[62]]
p_62 <- DAWpaper_p_62 %>%
strsplit("\n") %>%
unlist() %>%
as.data.frame() %>%
slice(-(1:2))
apply(p_62, 1, nchar) [1] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[20] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[39] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 63
[1] " 60"
[1] 3 4 4 4 3 4 2 3 2 4 4 3 4 3 4 4 4 4 4 4 3 2 4 4 4 4 4 4 4 2 3 3 3 3 3 4 4 4
[39] 3 3 2 3 3 4 4 4 3 4 2 3 4 4
[1] 2 3 3 3 2 3 2 2 2 3 3 2 3 2 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 2 2 3 2 3 3 3 3 3
[39] 3 3 2 3 3 3 3 3 2 3 2 3 3 3
[1] 2 2 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2
[39] 2 2 1 2 2 2 2 2 2 2 1 2 2 2
[1] 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
[39] 0 0 1 0 0 0 0 0 1 0 1 0 0 0
.
1 Alabama 1975 1975
2 Alaska 10/1/1994 0.252 1995
3 Arizona 7/17/1994 0.460 1995
4 Arkansas 7/27/1995 0.433 1996
5 California N/A 0
6 Colorado 5/17/2003 0.627 2003
apply(p_62, 1, str_count, "\\\\s{40,}")
1 1
2 0
3 0
4 0
5 1
6 0
p_62 <- p_62 %>%
apply(1,str_replace_all, "\\s{40,}", "|N/A|") %>%
str_replace_all("\\s{2,15}", "|") %>%
as.data.frame()
p_62 <- sapply(p_62$., str_split, "\\|{1,}")
sapply(p_62, nchar)$`|Alabama||1975|N/A|1975`
[1] 0 7 4 3 4
$`|Alaska||10/1/1994||0.252|||1995`
[1] 0 6 9 5 4
$`|Arizona| 7/17/1994||0.460|||1995`
[1] 0 7 10 5 4
$`|Arkansas| 7/27/1995||0.433|||1996`
[1] 0 8 10 5 4
$`|California||N/A|N/A|0`
[1] 0 10 3 3 1
$`|Colorado| 5/17/2003||0.627|||2003`
[1] 0 8 10 5 4
$`|Connecticut||1970|N/A|1970`
[1] 0 11 4 3 4
$`|Delaware||N/A|N/A|0`
[1] 0 8 3 3 1
$`District of Columbia|N/A|N/A|0`
[1] 20 3 3 1
$`|Florida| 10/1/1987||0.252|||1988`
[1] 0 7 10 5 4
$`|Georgia| 8/25/1989||0.353|||1990`
[1] 0 7 10 5 4
$`|Hawaii||N/A|N/A|0`
[1] 0 6 3 3 1
$`|Idaho||7/1/1990||0.504|||1990`
[1] 0 5 8 5 4
$`|Illinois| 1/5/2014|N/A|2014`
[1] 0 8 9 3 4
$`|Indiana| 1/15/1980||0.962|||1980`
[1] 0 7 10 5 4
$`|Iowa||1/1/2011||1.000|||2011`
[1] 0 4 8 5 4
$`|Kansas||1/1/2007||1.000|||2007`
[1] 0 6 8 5 4
$`|Kentucky| 10/1/1996||0.251|||1997`
[1] 0 8 10 5 4
$`|Louisiana|4/19/1996||0.702|||1996`
[1] 0 9 9 5 4
$`|Maine||9/19/1985||0.285|||1986`
[1] 0 5 9 5 4
$`|Maryland||N/A|N/A|0`
[1] 0 8 3 3 1
$`|Massachusetts||N/A|N/A|0`
[1] 0 13 3 3 1
$`|Michigan||7/1/2001||0.504|||2001`
[1] 0 8 8 5 4
$`|Minnesota| 5/28/2003||0.597|||2003`
[1] 0 9 10 5 4
$`|Mississippi|7/1/1990||0.504|||1990`
[1] 0 11 8 5 4
$`|Missouri| 2/26/2004||0.847|||2004`
[1] 0 8 10 5 4
$`|Montana||10/1/1991||0.252|||1992`
[1] 0 7 9 5 4
$`|Nebraska||1/1/2007||1.000|||2007`
[1] 0 8 8 5 4
$`|Nevada||10/1/1995||0.252|||1996`
[1] 0 6 9 5 4
$`|New Hampshire||1959|N/A|1959`
[1] 0 13 4 3 4
$`|New Jersey||N/A|N/A|0`
[1] 0 10 3 3 1
$`|New Mexico||1/1/2004||1.000|||2004`
[1] 0 10 8 5 4
$`|New York||N/A|N/A|0`
[1] 0 8 3 3 1
$`|North Carolina|12/1/1995||0.085|||1996`
[1] 0 14 9 5 4
$`|North Dakota| 8/1/1985||0.419|||1986`
[1] 0 12 9 5 4
$`|Ohio||4/8/2004||0.732|||2004`
[1] 0 4 8 5 4
$`|Oklahoma||1/1/1996||1.000|||1996`
[1] 0 8 8 5 4
$`|Oregon||1/1/1990||1.000|||1990`
[1] 0 6 8 5 4
$`|Pennsylvania|6/17/1989||0.542|||1989`
[1] 0 12 9 5 4
$`|Philadelphia|10/11/1995||0.225|||1996`
[1] 0 12 10 5 4
$`|Rhode Island||N/A|N/A|0`
[1] 0 12 3 3 1
$`|South Carolina|8/23/1996||0.358|||1997`
[1] 0 14 9 5 4
$`|South Dakota| 7/1/1985||0.504|||1985`
[1] 0 12 9 5 4
$`|Tennessee| 10/1/1996||0.251|||1997`
[1] 0 9 10 5 4
$`|Texas||1/1/1996||1.000|||1996`
[1] 0 5 8 5 4
$`|Utah||5/1/1995||0.671|||1995`
[1] 0 4 8 5 4
$`|Vermont||1970|N/A|1970`
[1] 0 7 4 3 4
$`|Virginia| 5/5/1995||0.660|||1995`
[1] 0 8 9 5 4
$`|Washington||1961|N/A|1961`
[1] 0 10 4 3 4
$`|West Virginia|7/7/1989||0.488|||1990`
[1] 0 13 8 5 4
$`|Wisconsin| 11/1/2011||0.167|||2012`
[1] 0 9 10 5 4
$`|Wyoming||10/1/1994||0.252|||1995`
[1] 0 7 9 5 4
p_62 <- lapply(p_62, function(x) x[nchar(x) > 0])
p_62 <- as.data.frame(do.call(rbind, p_62))
rownames(p_62) [1] "|Alabama||1975|N/A|1975"
[2] "|Alaska||10/1/1994||0.252|||1995"
[3] "|Arizona| 7/17/1994||0.460|||1995"
[4] "|Arkansas| 7/27/1995||0.433|||1996"
[5] "|California||N/A|N/A|0"
[6] "|Colorado| 5/17/2003||0.627|||2003"
[7] "|Connecticut||1970|N/A|1970"
[8] "|Delaware||N/A|N/A|0"
[9] "District of Columbia|N/A|N/A|0"
[10] "|Florida| 10/1/1987||0.252|||1988"
[11] "|Georgia| 8/25/1989||0.353|||1990"
[12] "|Hawaii||N/A|N/A|0"
[13] "|Idaho||7/1/1990||0.504|||1990"
[14] "|Illinois| 1/5/2014|N/A|2014"
[15] "|Indiana| 1/15/1980||0.962|||1980"
[16] "|Iowa||1/1/2011||1.000|||2011"
[17] "|Kansas||1/1/2007||1.000|||2007"
[18] "|Kentucky| 10/1/1996||0.251|||1997"
[19] "|Louisiana|4/19/1996||0.702|||1996"
[20] "|Maine||9/19/1985||0.285|||1986"
[21] "|Maryland||N/A|N/A|0"
[22] "|Massachusetts||N/A|N/A|0"
[23] "|Michigan||7/1/2001||0.504|||2001"
[24] "|Minnesota| 5/28/2003||0.597|||2003"
[25] "|Mississippi|7/1/1990||0.504|||1990"
[26] "|Missouri| 2/26/2004||0.847|||2004"
[27] "|Montana||10/1/1991||0.252|||1992"
[28] "|Nebraska||1/1/2007||1.000|||2007"
[29] "|Nevada||10/1/1995||0.252|||1996"
[30] "|New Hampshire||1959|N/A|1959"
[31] "|New Jersey||N/A|N/A|0"
[32] "|New Mexico||1/1/2004||1.000|||2004"
[33] "|New York||N/A|N/A|0"
[34] "|North Carolina|12/1/1995||0.085|||1996"
[35] "|North Dakota| 8/1/1985||0.419|||1986"
[36] "|Ohio||4/8/2004||0.732|||2004"
[37] "|Oklahoma||1/1/1996||1.000|||1996"
[38] "|Oregon||1/1/1990||1.000|||1990"
[39] "|Pennsylvania|6/17/1989||0.542|||1989"
[40] "|Philadelphia|10/11/1995||0.225|||1996"
[41] "|Rhode Island||N/A|N/A|0"
[42] "|South Carolina|8/23/1996||0.358|||1997"
[43] "|South Dakota| 7/1/1985||0.504|||1985"
[44] "|Tennessee| 10/1/1996||0.251|||1997"
[45] "|Texas||1/1/1996||1.000|||1996"
[46] "|Utah||5/1/1995||0.671|||1995"
[47] "|Vermont||1970|N/A|1970"
[48] "|Virginia| 5/5/1995||0.660|||1995"
[49] "|Washington||1961|N/A|1961"
[50] "|West Virginia|7/7/1989||0.488|||1990"
[51] "|Wisconsin| 11/1/2011||0.167|||2012"
[52] "|Wyoming||10/1/1994||0.252|||1995"
rownames(p_62) <- c()
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
sapply(p_62, class) STATE E_Date_RTC Frac_Yr_Eff_Yr_Pass RTC_Date_SA
"character" "character" "character" "character"
p_62 <- p_62 %>%
dplyr::select(STATE, RTC_Date_SA) %>%
rename("RTC_LAW_YEAR"=RTC_Date_SA) %>%
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) %>%
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))
sapply(p_62, class) STATE RTC_LAW_YEAR
"character" "numeric"
STATE RTC_LAW_YEAR
1 Alabama 1975
2 Alaska 1995
3 Arizona 1995
4 Arkansas 1996
5 California Inf
6 Colorado 2003
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "STATE" "YEAR" "VALUE" "VARIABLE"
[1] "YEAR" "STATE" "VALUE" "VARIABLE"
[1] "YEAR" "VALUE" "STATE" "VARIABLE"
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Female_10_to_19_years 3.01
2 1977 Alabama Black_Female_20_to_29_years 2.33
3 1977 Alabama Black_Female_30_to_39_years 1.29
4 1977 Alabama Black_Female_40_to_49_years 1.18
5 1977 Alabama Black_Female_50_to_64_years 1.73
6 1977 Alabama Black_Female_65_years_and_over 1.58
# A tibble: 6 x 4
STATE YEAR VALUE VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 1977 7.3 Unemployment_rate
2 Alabama 1978 6.4 Unemployment_rate
3 Alabama 1979 7.2 Unemployment_rate
4 Alabama 1980 8.9 Unemployment_rate
5 Alabama 1981 10.6 Unemployment_rate
6 Alabama 1982 14.1 Unemployment_rate
# A tibble: 6 x 4
YEAR STATE VALUE VARIABLE
<dbl> <chr> <dbl> <chr>
1 2018 Alabama 16 Poverty_rate
2 2018 Alaska 13.1 Poverty_rate
3 2018 Arizona 12.8 Poverty_rate
4 2018 Arkansas 15.9 Poverty_rate
5 2018 California 11.9 Poverty_rate
6 2018 Colorado 9.1 Poverty_rate
# A tibble: 6 x 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 15293 Alabama Viol_crime_count
2 1978 15682 Alabama Viol_crime_count
3 1979 15578 Alabama Viol_crime_count
4 1980 17320 Alabama Viol_crime_count
5 1981 18423 Alabama Viol_crime_count
6 1982 17653 Alabama Viol_crime_count
DONOHUE_DF <- bind_rows(dem_DONOHUE,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
DONOHUE_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
DONOHUE_DF <- DONOHUE_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
DONOHUE_DF <- DONOHUE_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
31 31 31
Arkansas California Colorado
31 31 31
Connecticut District of Columbia Delaware
31 31 31
Florida Georgia Hawaii
31 31 31
Idaho Illinois Indiana
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Hampshire
31 31 31
New Jersey New Mexico New York
31 31 31
North Carolina North Dakota Ohio
31 31 31
Oklahoma Oregon Pennsylvania
31 31 31
Rhode Island South Carolina South Dakota
31 31 31
Tennessee Texas Utah
31 31 31
Vermont Virginia Washington
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
DONOHUE_DF <- DONOHUE_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
DONOHUE_DF <- DONOHUE_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(DONOHUE_DF$STATE))) Alaska Arizona Arkansas
31 31 31
California Colorado District of Columbia
31 31 31
Delaware Florida Georgia
31 31 31
Hawaii Idaho Illinois
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Jersey
31 31 31
New Mexico New York North Carolina
31 31 31
North Dakota Ohio Oklahoma
31 31 31
Oregon Pennsylvania Rhode Island
31 31 31
South Carolina South Dakota Tennessee
31 31 31
Texas Utah Virginia
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
[1] 45
LOTT_DF <- bind_rows(dem_LOTT,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
LOTT_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
LOTT_DF <- LOTT_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
LOTT_DF <- LOTT_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
31 31 31
Arkansas California Colorado
31 31 31
Connecticut District of Columbia Delaware
31 31 31
Florida Georgia Hawaii
31 31 31
Idaho Illinois Indiana
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Hampshire
31 31 31
New Jersey New Mexico New York
31 31 31
North Carolina North Dakota Ohio
31 31 31
Oklahoma Oregon Pennsylvania
31 31 31
Rhode Island South Carolina South Dakota
31 31 31
Tennessee Texas Utah
31 31 31
Vermont Virginia Washington
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
LOTT_DF <- LOTT_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
LOTT_DF <- LOTT_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(LOTT_DF$STATE))) Alaska Arizona Arkansas
31 31 31
California Colorado District of Columbia
31 31 31
Delaware Florida Georgia
31 31 31
Hawaii Idaho Illinois
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Jersey
31 31 31
New Mexico New York North Carolina
31 31 31
North Dakota Ohio Oklahoma
31 31 31
Oregon Pennsylvania Rhode Island
31 31 31
South Carolina South Dakota Tennessee
31 31 31
Texas Utah Virginia
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
[1] 45
STATE YEAR Black_Male_15_to_19_years
"factor" "numeric" "numeric"
Black_Male_20_to_39_years Other_Male_15_to_19_years Other_Male_20_to_39_years
"numeric" "numeric" "numeric"
White_Male_15_to_19_years White_Male_20_to_39_years Unemployment_rate
"numeric" "numeric" "numeric"
Poverty_rate Viol_crime_count Population
"numeric" "numeric" "numeric"
police_per_100k_lag RTC_LAW_YEAR RTC_LAW
"numeric" "numeric" "logical"
TIME_0 TIME_INF Viol_crime_rate_1k
"numeric" "numeric" "numeric"
Viol_crime_rate_1k_log Population_log
"numeric" "numeric"
DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
geom_point(size = 0.5) +
geom_line(aes(group=STATE),
size = 0.5,
show.legend = FALSE) +
geom_text_repel(data = DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
filter(YEAR == last(YEAR)),
aes(label = STATE,
x = YEAR,
y = Viol_crime_rate_100k_log),
size = 3,
alpha = 1,
nudge_x = 10,
direction = "y",
hjust = 1,
vjust = 1,
segment.size = 0.25,
segment.alpha = 0.25,
force = 1,
max.iter = 9999) +
guides(color = FALSE) +
scale_x_continuous(breaks = seq(1980, 2015, by = 1),
limits = c(1980, 2015),
labels = c(seq(1980, 2010, by = 1), rep("", 5))) +
scale_y_continuous(breaks = seq(3.5, 8.5, by = 0.5),
limits = c(3.5, 8.5)) +
labs(title = "States have different levels of crime",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))DONOHUE_DF %>%
group_by(YEAR) %>%
summarise(Viol_crime_count = sum(Viol_crime_count),
Population = sum(Population),
.groups = "drop") %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
geom_line() +
scale_x_continuous(breaks = seq(1980, 2010, by = 1),
limits = c(1980, 2010),
labels = c(seq(1980, 2010, by = 1))) +
scale_y_continuous(breaks = seq(5.75, 6.75, by = 0.25),
limits = c(5.75, 6.75)) +
labs(title = "Crime rates fluctuate over time",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index=c("STATE", "YEAR"))
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "within",
data=d_panel_DONOHUE)
summary(DONOHUE_OUTPUT)Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = d_panel_DONOHUE,
effect = "twoways", model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.57957437 -0.08942194 -0.00090654 0.08673054 1.11216999
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.01796779 0.01663911 1.0799 0.2804066
White_Male_15_to_19_years -0.00091825 0.02724210 -0.0337 0.9731160
White_Male_20_to_39_years 0.03466473 0.00972839 3.5633 0.0003794 ***
Black_Male_15_to_19_years -0.05673593 0.05746052 -0.9874 0.3236341
Black_Male_20_to_39_years 0.12605439 0.01931450 6.5264 9.607e-11 ***
Other_Male_15_to_19_years 0.69201638 0.11322394 6.1119 1.297e-09 ***
Other_Male_20_to_39_years -0.30276797 0.03811855 -7.9428 4.226e-15 ***
Unemployment_rate -0.01685806 0.00489952 -3.4408 0.0005984 ***
Poverty_rate -0.00780235 0.00295720 -2.6384 0.0084280 **
Population_log -0.17991653 0.06041773 -2.9779 0.0029559 **
police_per_100k_lag 0.00060391 0.00013689 4.4115 1.111e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 36.716
R-Squared: 0.15031
Adj. R-Squared: 0.095138
F-statistic: 21.0514 on 11 and 1309 DF, p-value: < 2.22e-16
Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
LOTT_variables <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames()
LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables, collapse = " + ")
)
)
d_panel_LOTT <- pdata.frame(LOTT_DF, index=c("STATE", "YEAR"))
LOTT_OUTPUT <- plm(LOTT_fmla,
model = "within",
effect = "twoways",
data=d_panel_LOTT)
summary(LOTT_OUTPUT)Twoways effects Within Model
Call:
plm(formula = LOTT_fmla, data = d_panel_LOTT, effect = "twoways",
model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5448906 -0.0780395 0.0026738 0.0788052 0.5847263
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE -0.04687169 0.01641851 -2.8548 0.0043758 **
White_Female_10_to_19_years 0.62441376 0.15103427 4.1343 3.793e-05 ***
White_Female_20_to_29_years -0.05942541 0.06332108 -0.9385 0.3481763
White_Female_30_to_39_years 0.16028113 0.08045953 1.9921 0.0465755 *
White_Female_40_to_49_years 0.10087510 0.08170707 1.2346 0.2172082
White_Female_50_to_64_years -0.37624966 0.06303172 -5.9692 3.083e-09 ***
White_Female_65_years_and_over 0.20636690 0.04742430 4.3515 1.460e-05 ***
White_Male_10_to_19_years -0.59141591 0.14436974 -4.0965 4.457e-05 ***
White_Male_20_to_29_years 0.08717546 0.05862342 1.4870 0.1372503
White_Male_30_to_39_years -0.12514225 0.08588569 -1.4571 0.1453400
White_Male_40_to_49_years -0.21812366 0.07293615 -2.9906 0.0028375 **
White_Male_50_to_64_years 0.37845575 0.07314122 5.1743 2.653e-07 ***
White_Male_65_years_and_over -0.20915907 0.06659815 -3.1406 0.0017246 **
Black_Female_10_to_19_years -1.03146594 0.43610403 -2.3652 0.0181697 *
Black_Female_20_to_29_years -0.02721685 0.17462559 -0.1559 0.8761693
Black_Female_30_to_39_years -0.03246043 0.20498789 -0.1584 0.8742037
Black_Female_40_to_49_years 0.43820099 0.23524130 1.8628 0.0627234 .
Black_Female_50_to_64_years 0.04906111 0.21393128 0.2293 0.8186482
Black_Female_65_years_and_over 0.07226074 0.24373031 0.2965 0.7669130
Black_Male_10_to_19_years 1.22536162 0.44559642 2.7499 0.0060447 **
Black_Male_20_to_29_years -0.06587312 0.18392655 -0.3581 0.7202909
Black_Male_30_to_39_years 0.24720746 0.23673862 1.0442 0.2965804
Black_Male_40_to_49_years -0.66869983 0.27173041 -2.4609 0.0139904 *
Black_Male_50_to_64_years -0.16737616 0.23977741 -0.6980 0.4852740
Black_Male_65_years_and_over -0.58743446 0.34691532 -1.6933 0.0906404 .
Other_Female_10_to_19_years 0.70957924 0.49539878 1.4323 0.1522910
Other_Female_20_to_29_years -1.16489945 0.26997487 -4.3148 1.720e-05 ***
Other_Female_30_to_39_years -3.40258912 0.35368437 -9.6204 < 2.2e-16 ***
Other_Female_40_to_49_years 1.34563633 0.42503994 3.1659 0.0015825 **
Other_Female_50_to_64_years 2.93990932 0.33830653 8.6901 < 2.2e-16 ***
Other_Female_65_years_and_over 2.36026239 0.20422580 11.5571 < 2.2e-16 ***
Other_Male_10_to_19_years 0.07481449 0.47835310 0.1564 0.8757423
Other_Male_20_to_29_years 1.62895925 0.25740603 6.3284 3.420e-10 ***
Other_Male_30_to_39_years 3.17421278 0.41184489 7.7073 2.566e-14 ***
Other_Male_40_to_49_years -1.58494177 0.44840281 -3.5346 0.0004229 ***
Other_Male_50_to_64_years -3.91523867 0.37399898 -10.4686 < 2.2e-16 ***
Other_Male_65_years_and_over -4.16596244 0.36860536 -11.3020 < 2.2e-16 ***
Unemployment_rate -0.00545734 0.00436374 -1.2506 0.2113054
Poverty_rate -0.00572362 0.00253162 -2.2609 0.0239357 *
Population_log -0.21716335 0.08452664 -2.5692 0.0103068 *
police_per_100k_lag 0.00069547 0.00013331 5.2171 2.118e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 23.647
R-Squared: 0.45275
Adj. R-Squared: 0.40355
F-statistic: 25.8088 on 41 and 1279 DF, p-value: < 2.22e-16
comparing_analyses <- DONOHUE_OUTPUT_TIDY %>%
bind_rows(LOTT_OUTPUT_TIDY) %>%
filter(term == "RTC_LAWTRUE")
library(grid)
comparing_analyses_plot <- ggplot(comparing_analyses) +
geom_point(aes(x = Analysis, y = estimate)) +
geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
geom_hline(yintercept = 0, color = "red") +
scale_y_continuous(breaks = seq(-0.2, 0.2, by = 0.05),
labels = seq(-0.2, 0.2, by = 0.05),
limits = c(-0.2,0.2)) +
geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "green",
lineend = "butt",
linejoin = "mitre") +
geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "red",
lineend = "butt",
linejoin = "mitre") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text = element_text(size = 12)) +
labs(title = "Effect estimate on ln(violent crimes per 100,000 people)",
y = "Effect estimate (95% CI)")
comparing_analyses_plotHow did the above happen?
The analysis dataframes are very similar yet rendered very different results.
- different number of columns: 20 vs 50
[1] TRUE
The only difference between the two dataframes rests in how the demographic variables were parameterized.
[1] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[3] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[5] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[1] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[3] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[5] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[7] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[9] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[11] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[13] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[15] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[17] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[19] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[21] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[23] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
[25] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[27] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[29] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[31] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[33] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[35] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
Clearly, this had an effect on the results of the analysis.
Let’s explore how this occured.
When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
In regression analysis, this distortion is often a byproduct of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference.
There are several ways we can diagnose multicollinearity.
Again, multicollinearity often occurs when independent variables are highly related to one another. Consequently, we can evaluate these relationships be examining the correlation between variable pairs.
It is important to note that multicollinearity and correlation are not one and the same. Correlation can be thought of as the strength of the relationship between variables. On the other hand, multicollinearity can be thought of the the violation of the independence assumption that is a consequence of this correlation in a regression analysis.
[1] "STATE" "YEAR"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
DONOHUE_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))LOTT_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))cor_DONOHUE_dem <- cor(DONOHUE_DF %>% dplyr::select(contains("_years")))
corr_mat_DONOHUE <- ggcorrplot(cor_DONOHUE_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 1",
legend.title = TeX("$\\rho$"))
corr_mat_DONOHUEcor_LOTT_dem <- cor(LOTT_DF %>% dplyr::select(contains("_years")))
corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 2",
legend.title = TeX("$\\rho$"))
corr_mat_LOTTsims <- 250
# DONOHUE
# round(dim(DONOHUE_DF)[1]/2)
samps_DONOHUE <- lapply(rep(dim(DONOHUE_DF)[1]-1, sims),
function(x)DONOHUE_DF[sample(nrow(DONOHUE_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_DONOHUE <- function(split){
plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_DONOHUE <- lapply(samps_DONOHUE, fit_nls_on_bootstrap_DONOHUE)
samps_models_DONOHUE <- samps_models_DONOHUE %>%
map(tidy)
names(samps_models_DONOHUE) <- paste0("DONOHUE_",1:length(samps_models_DONOHUE))
simulations_DONOHUE <- samps_models_DONOHUE %>%
bind_rows(.id = "ID") %>%
mutate(Analysis = "Analysis 1")
## LOTT
samps_LOTT <- lapply(rep(round(dim(LOTT_DF)[1]/2), sims),
function(x) LOTT_DF[sample(nrow(LOTT_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_LOTT <- function(split){
plm(LOTT_fmla,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_LOTT <- lapply(samps_LOTT, fit_nls_on_bootstrap_LOTT)
samps_models_LOTT <- samps_models_LOTT %>%
map(tidy)
names(samps_models_LOTT) <- paste0("LOTT_",1:length(samps_models_LOTT))
simulations_LOTT <- samps_models_LOTT %>%
bind_rows(.id = "Analysis") %>%
mutate(Analysis = "Analysis 2")
simulations <- bind_rows(simulations_DONOHUE,
simulations_LOTT)
simulation_plot <- simulations %>%
filter(term=="RTC_LAWTRUE") %>%
ggplot(aes(x = Analysis, y = estimate)) +
geom_jitter(alpha = 0.25,
width = 0.1) +
labs(title = "Coefficient instability",
subtitle = "Estimates sensitive to observation deletions",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank())
simulation_plotdesign.matrix <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_DONOHUE$Viol_crime_rate_1k_log)
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE + # logical class changes variable name after inital model
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = design.matrix)
vif(lm_DONOHUE) RTC_LAWTRUE White_Male_15_to_19_years White_Male_20_to_39_years
1.097853 1.172339 1.738459
Black_Male_15_to_19_years Black_Male_20_to_39_years Other_Male_15_to_19_years
1.344193 1.653712 1.586648
Other_Male_20_to_39_years Unemployment_rate Poverty_rate
1.529688 1.244667 1.270321
Population_log police_per_100k_lag
1.153933 1.204491
vif_DONOHUE <- vif(lm_DONOHUE)
vif_DONOHUE <- vif_DONOHUE %>%
as_tibble() %>%
cbind(., names(vif_DONOHUE)) %>%
as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")
max_vif_DONOHUE <- max(vif(lm_DONOHUE)) design.matrix <- as.data.frame(model.matrix(LOTT_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_LOTT$Viol_crime_rate_1k_log)
LOTT_variables_ols <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames() %>%
str_replace("RTC_LAW", "RTC_LAWTRUE") # logical class changes variable name after inital model
LOTT_fmla_ols <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables_ols, collapse = " + ")
)
)
lm_LOTT <- lm(LOTT_fmla_ols,
data = design.matrix)
vif(lm_LOTT) RTC_LAWTRUE White_Female_10_to_19_years
1.621662 127.920555
White_Female_20_to_29_years White_Female_30_to_39_years
42.269637 49.635014
White_Female_40_to_49_years White_Female_50_to_64_years
37.550101 36.451868
White_Female_65_years_and_over White_Male_10_to_19_years
12.866751 126.824984
White_Male_20_to_29_years White_Male_30_to_39_years
39.248785 73.008959
White_Male_40_to_49_years White_Male_50_to_64_years
31.613855 52.774694
White_Male_65_years_and_over Black_Female_10_to_19_years
13.285326 335.136906
Black_Female_20_to_29_years Black_Female_30_to_39_years
106.644486 79.058455
Black_Female_40_to_49_years Black_Female_50_to_64_years
98.434064 66.888057
Black_Female_65_years_and_over Black_Male_10_to_19_years
49.715869 320.740453
Black_Male_20_to_29_years Black_Male_30_to_39_years
89.297151 89.267356
Black_Male_40_to_49_years Black_Male_50_to_64_years
92.498486 64.538516
Black_Male_65_years_and_over Other_Female_10_to_19_years
37.960126 142.283700
Other_Female_20_to_29_years Other_Female_30_to_39_years
64.966861 54.511835
Other_Female_40_to_49_years Other_Female_50_to_64_years
224.567085 131.463113
Other_Female_65_years_and_over Other_Male_10_to_19_years
82.394398 151.930450
Other_Male_20_to_29_years Other_Male_30_to_39_years
54.620045 62.267344
Other_Male_40_to_49_years Other_Male_50_to_64_years
244.698473 174.184553
Other_Male_65_years_and_over Unemployment_rate
53.532299 1.497864
Poverty_rate Population_log
1.412397 3.426475
police_per_100k_lag
1.732745
vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT %>%
as_tibble() %>%
cbind(., names(vif_LOTT)) %>%
as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")
max_vif_LOTT <- max(vif(lm_LOTT))[1] 1.738459
[1] 335.1369
\[\frac{1}{1-R_{i}^{2}}\]
vif_DONOHUE$Analysis <- "Analysis 1"
vif_LOTT$Analysis <- "Analysis 2"
vif_df <- rbind(vif_DONOHUE,
vif_LOTT)
vif_plot <- vif_df %>%
ggplot(aes(x = Analysis, y = VIF)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
geom_hline(yintercept = 10, color = "red") +
scale_y_continuous(trans = 'log10',
limits = c(1,1000)) +
labs(title = "Variance inflation factors") +
theme_minimal() +
theme(axis.title.x = element_blank())
vif_plothttps://rpubs.com/rslbliss/fixed_effects
http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
https://stats.stackexchange.com/questions/99236/effects-in-panel-models-individual-time-or-twoways